BNLL CEWL Data Wrangling

Savannah Weaver

Packages

Background and Goals

This CEWL (cutaneous evaporative water loss) data was measured in 3-5 technical replicates on the mid-dorsum of Blunt-nosed Leopard Lizards (Gambelia sila) between April - July 2021. In this R script, I check the distribution of replicates, omit outliers, and average remaining replicates. The final values will be more precise and accurate estimates of the true CEWL for each lizard, and those values will be used in the analyses R script file. Please refer to doi: for the published scientific paper and full details.

Load Data

  1. Compile a list of the filenames I need to read-in.
# make a list of file names of all data to load in
filenames <- list.files(path = "data/CEWL")
  1. Make a function that will read in the data from each csv, name and organize the data correctly.
read_CEWL_file <- function(filename) {
  
  dat <- read.csv(file.path("data/CEWL", filename),
                  na.strings=c("","NA"),
                # each csv has headers
                header = TRUE
                ) %>%
    # select only the relevant values
    dplyr::select(date = Date, 
                  time = Time, 
                  status = Status,
                  ID_rep_no = Comments,
                  CEWL_g_m2h = 'TEWL..g..m2h..', 
                  msmt_temp_C = 'AmbT..C.', 
                  msmt_RH_percent = 'AmbRH....'
                  ) 
  
  # return the dataframe for that single csv file
  dat
}
  1. Apply the function I made to all of the filenames I compiled, then put all of those dataframes into one dataframe. This will print warnings saying that header and col.names are different lengths, because the data has extra notes cols that we read-in, but get rid of. Additionally, filter out failed measurements and properly format data classes.
# apply function to get data from all csvs
all_CEWL_data <- lapply(filenames, read_CEWL_file) %>%
  # paste all data files together into one df by row
  reduce(rbind) %>%
    # extract individual_ID and replicate number
    dplyr::mutate(ID_rep_no = as.character(ID_rep_no),
                  ID_len = as.factor(nchar(ID_rep_no)),
                  
                  individual_ID = as.factor(case_when(
                    ID_len == 7 ~ as.character(paste(substr(ID_rep_no, 1, 1),
                                             substr(ID_rep_no, 3, 5),
                                             sep = "-")),
                    ID_len == 6 & substr(ID_rep_no, 1, 1) == "W" 
                        ~ as.character(paste(substr(ID_rep_no, 1, 1),
                                             substr(ID_rep_no, 2, 4),
                                             sep = "-")),
                    ID_len == 6 & substr(ID_rep_no, 1, 1) %in% c("M", "F") 
                        ~ as.character(paste(substr(ID_rep_no, 1, 1),
                                             substr(ID_rep_no, 3, 4),
                                             sep = "-")),
                    ID_len == 5 ~ as.character(paste(substr(ID_rep_no, 1, 1),
                                             substr(ID_rep_no, 2, 3),
                                             sep = "-")))
                    ),
                  replicate_no = as.factor(case_when(
                    ID_len == 7 ~ as.character(substr(ID_rep_no, 7, 7)),
                    ID_len == 6 ~ as.character(substr(ID_rep_no, 6, 6)),
                    ID_len == 5 ~ as.character(substr(ID_rep_no, 5, 5))
                    ))) %>%
  # filter out failed measurements
  dplyr::filter(status == "Normal") %>%
  # correctly format data classes
  mutate(date = as.Date(date, format = "%m/%d/%y"),
         time = as.POSIXct(time, format = "%H:%M"),
         status = as.factor(status)
         )

summary(all_CEWL_data)
##       date                 time                           status   
##  Min.   :2021-04-23   Min.   :2023-11-09 01:00:00.00   Normal:456  
##  1st Qu.:2021-04-24   1st Qu.:2023-11-09 02:24:45.00               
##  Median :2021-05-07   Median :2023-11-09 03:46:00.00               
##  Mean   :2021-05-12   Mean   :2023-11-09 04:22:43.42               
##  3rd Qu.:2021-05-08   3rd Qu.:2023-11-09 05:02:15.00               
##  Max.   :2021-07-14   Max.   :2023-11-09 12:59:00.00               
##                                                                    
##   ID_rep_no           CEWL_g_m2h     msmt_temp_C    msmt_RH_percent ID_len 
##  Length:456         Min.   :-1.32   Min.   :18.90   Min.   :11.50   5:122  
##  Class :character   1st Qu.: 7.74   1st Qu.:28.50   1st Qu.:14.50   6:244  
##  Mode  :character   Median :10.21   Median :30.30   Median :16.95   7: 90  
##                     Mean   :10.62   Mean   :29.55   Mean   :21.36          
##                     3rd Qu.:12.89   3rd Qu.:31.50   3rd Qu.:24.30          
##                     Max.   :65.31   Max.   :33.70   Max.   :41.60          
##                                                                            
##  individual_ID replicate_no
##  F-12   : 13   1:117       
##  M-10   : 13   2:118       
##  M-11   : 13   3:118       
##  M-19   : 13   4: 52       
##  M-20   : 13   5: 51       
##  M-09   : 12               
##  (Other):379
unique(all_CEWL_data$individual_ID)
##  [1] F-01  F-10  F-11  F-12  F-13  F-14  F-15  F-16  F-17  F-18  F-19  F-02 
## [13] F-03  F-04  F-05  F-06  F-07  F-08  F-09  M-01  M-10  M-11  M-12  M-13 
## [25] M-14  M-15  M-16  M-17  M-18  M-19  M-02  M-20  M-03  M-04  M-05  M-06 
## [37] M-07  M-08  M-09  W-010 W-011 W-012 W-013 W-014 W-015 W-016 W-017 W-018
## [49] W-019 W-002 W-020 W-021 W-022 W-023 W-024 W-025 W-026 W-027 W-028 W-003
## [61] W-004 W-005 W-006 W-007 W-008 W-009 W-029 W-030 W-031 W-001 W-032 F-08A
## [73] W-034 W-035 W-033 W-036 W-037 W-038 M-03A W-039
## 80 Levels: F-01 F-02 F-03 F-04 F-05 F-06 F-07 F-08 F-08A F-09 F-10 ... W-039

Check Data

Each lizard measured on each date should have 3-5 technical replicates, and those measurements should have been taken around the same time.

all_CEWL_data %>%
                group_by(individual_ID, date) %>%
                summarise(n = n(),
                          time_range = max(time) - min(time)) %>% 
                arrange(n)
## `summarise()` has grouped output by 'individual_ID'. You can override using the
## `.groups` argument.
## # A tibble: 118 × 4
## # Groups:   individual_ID [80]
##    individual_ID date           n time_range
##    <fct>         <date>     <int> <drtn>    
##  1 F-01          2021-04-23     3 120 secs  
##  2 F-02          2021-04-23     3 120 secs  
##  3 F-03          2021-04-23     3 120 secs  
##  4 F-04          2021-04-23     3  60 secs  
##  5 F-05          2021-04-24     3 120 secs  
##  6 F-06          2021-04-24     3 120 secs  
##  7 F-07          2021-04-24     3  60 secs  
##  8 F-08          2021-04-24     3  60 secs  
##  9 F-09          2021-04-24     3 120 secs  
## 10 F-10          2021-04-24     3 120 secs  
## # … with 108 more rows

The number of measurements taken is good! Almost always 3 or 5, with two lizards who only got 4 measurements, which is fine. But, M01 on April 23 and M03a on July 14 have abnormal time ranges of 43140 seconds (almost 12h), so we need to check that data.

all_CEWL_data %>% dplyr::filter(individual_ID %in% c("M-01", "M-03A"))
##          date                time status ID_rep_no CEWL_g_m2h msmt_temp_C
## 1  2021-04-23 2023-11-09 12:57:00 Normal     M01_1       0.69        31.0
## 2  2021-04-23 2023-11-09 12:59:00 Normal     M01_2       0.14        30.7
## 3  2021-04-23 2023-11-09 01:00:00 Normal     M01_3       1.12        30.5
## 4  2021-07-14 2023-11-09 12:58:00 Normal   M-03A-1       9.98        27.4
## 5  2021-07-14 2023-11-09 12:59:00 Normal   M-03A-2       9.16        27.8
## 6  2021-07-14 2023-11-09 01:00:00 Normal   M-03A-3      11.05        28.0
## 7  2021-07-14 2023-11-09 01:01:00 Normal   M-03A-4      13.29        28.1
## 8  2021-07-14 2023-11-09 01:02:00 Normal   M-03A-5       8.69        28.4
## 9  2021-07-14 2023-11-09 05:00:00 Normal    M-01-1      13.70        27.4
## 10 2021-07-14 2023-11-09 05:01:00 Normal    M-01-2      10.94        27.2
## 11 2021-07-14 2023-11-09 05:02:00 Normal    M-01-3      11.35        27.0
## 12 2021-07-14 2023-11-09 05:03:00 Normal    M-01-4       9.39        26.8
## 13 2021-07-14 2023-11-09 05:04:00 Normal    M-01-5       8.90        26.6
##    msmt_RH_percent ID_len individual_ID replicate_no
## 1             15.9      5          M-01            1
## 2             16.3      5          M-01            2
## 3             16.7      5          M-01            3
## 4             37.1      7         M-03A            1
## 5             36.8      7         M-03A            2
## 6             37.1      7         M-03A            3
## 7             35.9      7         M-03A            4
## 8             35.2      7         M-03A            5
## 9             39.7      6          M-01            1
## 10            39.6      6          M-01            2
## 11            39.5      6          M-01            3
## 12            39.6      6          M-01            4
## 13            39.6      6          M-01            5

Aha, it seems the problem is that the time isn’t perfectly formatted, so 1 pm is coded as 1 am –> the measurements in question went across hours of 12 noon to 1 pm, so when reformatted, it seems like 1 am to 12 pm. It’s fine as-is, and nothing is amiss with the data.

Replicates

Assess Variation

We want the Coefficient of Variation (CV) among our technical replicates to be small. We need to calculate it to identify whether there may be outliers.

CVs <- all_CEWL_data %>%
  group_by(individual_ID, date) %>%
  summarise(mean = mean(CEWL_g_m2h),
            SD = sd(CEWL_g_m2h),
            CV = (SD/mean) *100,
            min = min(CEWL_g_m2h),
            max = max(CEWL_g_m2h),
            range = max - min
            )
## `summarise()` has grouped output by 'individual_ID'. You can override using the
## `.groups` argument.
summary(CVs)
##  individual_ID      date                 mean              SD         
##  F-12   :  3   Min.   :2021-04-23   Min.   : 0.650   Min.   : 0.1124  
##  M-09   :  3   1st Qu.:2021-04-24   1st Qu.: 8.486   1st Qu.: 1.4849  
##  M-10   :  3   Median :2021-04-24   Median :10.443   Median : 2.0290  
##  M-11   :  3   Mean   :2021-05-08   Mean   :10.823   Mean   : 2.9641  
##  M-19   :  3   3rd Qu.:2021-05-08   3rd Qu.:13.391   3rd Qu.: 3.1195  
##  M-20   :  3   Max.   :2021-07-14   Max.   :31.550   Max.   :29.3242  
##  (Other):100                                                          
##        CV               min              max            range       
##  Min.   :  1.956   Min.   :-1.320   Min.   : 1.12   Min.   : 0.220  
##  1st Qu.: 15.021   1st Qu.: 6.723   1st Qu.:10.21   1st Qu.: 3.130  
##  Median : 20.135   Median : 8.245   Median :13.32   Median : 4.600  
##  Mean   : 28.713   Mean   : 8.159   Mean   :14.36   Mean   : 6.196  
##  3rd Qu.: 35.639   3rd Qu.:10.500   3rd Qu.:16.37   3rd Qu.: 6.772  
##  Max.   :105.713   Max.   :19.640   Max.   :65.31   Max.   :52.900  
## 
hist(CVs$CV)

hist(CVs$range) 

We expect CV for technical replicates to be < 10-15%, so we must determine whether the CVs > 15% are due to outlier replicates. The range should also generally be within 5 units for these measurements. :(

Find Outliers

First, create a function to look at the replicates for each individual on each day. For each iteration, I will make a boxplot and extract any outliers, compiling a dataframe of outliers that I want to exclude from the final dataset. By printing the boxplots and compiling a dataframe of outliers, I can check the data against the plots to ensure confidence in the outliers quantified.

# write function to find outliers for each individual on each date
find_outliers <- function(df) {
  
  # initiate dataframe to compile info and list to compile plots
  outliers <- data.frame()
  #boxplots <- list()

  # initiate a for loop to go through every who in df
  for(indiv_ch in unique(df$individual_ID)) {
    
    # select data for only the individual of interest
    df_sub <- df %>%
      dplyr::filter(individual_ID == (indiv_ch))
    
    # make a boxplot
    df_sub %>%
      ggplot(.) +
      geom_boxplot(aes(x = as.factor(date),
                       y = CEWL_g_m2h,
                       fill = as.factor(date))) +
      ggtitle(paste("Individual", indiv_ch)) +
      theme_classic() -> plot
    
    # print/save
    print(plot)
    #boxplots[[indiv_ch]] <- plot
    
    # extract outliers
    outs <- df_sub %>%
      group_by(individual_ID, date) %>%
      summarise(outs = boxplot.stats(CEWL_g_m2h)$out)
    
    # add to running dataframe of outliers
    outliers <- outliers %>%
      rbind(outs)
  }
  #return(boxplots)
  return(outliers)
}

Now apply the function to the data:

par(mfrow = c(71, 2))
outliers_found <- find_outliers(all_CEWL_data)
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID'. You can override using the
## `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID'. You can override using the
## `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID'. You can override using the
## `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID'. You can override using the
## `.groups` argument.
## `summarise()` has grouped output by 'individual_ID'. You can override using the
## `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID', 'date'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'individual_ID'. You can override using the
## `.groups` argument.
outliers_found
## # A tibble: 24 × 3
## # Groups:   individual_ID, date [18]
##    individual_ID date        outs
##    <fct>         <date>     <dbl>
##  1 F-13          2021-05-08 41.9 
##  2 F-06          2021-05-08 13.8 
##  3 M-10          2021-05-07  7.17
##  4 M-10          2021-05-07  2.79
##  5 M-11          2021-05-07 13.5 
##  6 M-11          2021-07-14 11.5 
##  7 M-13          2021-07-14 17.1 
##  8 M-13          2021-07-14 11.3 
##  9 M-19          2021-07-14 17.9 
## 10 M-20          2021-07-14 11.3 
## # … with 14 more rows
par(mfrow = c(1, 1))

Based on the plots, the dataframe of outliers I compiled is correct. (yay!)

Remove Outliers

Now I will create a secondary version of the same function, but instead of compiling outliers, I will omit them from the dataset.

# write function to find and exclude outliers
omit_outliers <- function(df) {
  
  # initiate dataframe to compile info and list to compile plots
  cleaned <- data.frame()

  # initiate a for loop to go through every who in df
  for(indiv_ch in unique(df$individual_ID)) {
    
    # select data for only the individual of interest
    df_sub <- df %>%
      dplyr::filter(individual_ID == (indiv_ch))
    
    # extract outliers
    outs <- df_sub %>%
      group_by(individual_ID, date) %>%
      summarise(outs = boxplot.stats(CEWL_g_m2h)$out)
    
    # filter outliers from data subset for this individual
    filtered <- df_sub %>%
      dplyr::filter(CEWL_g_m2h %nin% outs$outs)
    
    # add to running dataframe of cleaned data
    cleaned <- cleaned %>%
      rbind(filtered)
  }
  return(cleaned)
}

Apply function to data and check that the new data subsets still contain the right amount of data:

outliers_omitted <- omit_outliers(all_CEWL_data)
nrow(all_CEWL_data) == nrow(outliers_omitted) + nrow(outliers_found)
## [1] TRUE

Re-Assess Variation

new_CVs <- outliers_omitted %>%
  group_by(individual_ID, date) %>%
  summarise(mean = mean(CEWL_g_m2h),
            SD = sd(CEWL_g_m2h),
            CV = (SD/mean) *100,
            min = min(CEWL_g_m2h),
            max = max(CEWL_g_m2h),
            range = max - min)
## `summarise()` has grouped output by 'individual_ID'. You can override using the
## `.groups` argument.
summary(new_CVs)
##  individual_ID      date                 mean              SD          
##  F-12   :  3   Min.   :2021-04-23   Min.   : 0.650   Min.   : 0.05508  
##  M-09   :  3   1st Qu.:2021-04-24   1st Qu.: 8.486   1st Qu.: 1.21719  
##  M-10   :  3   Median :2021-04-24   Median :10.421   Median : 1.85776  
##  M-11   :  3   Mean   :2021-05-08   Mean   :10.682   Mean   : 2.65196  
##  M-19   :  3   3rd Qu.:2021-05-08   3rd Qu.:13.239   3rd Qu.: 2.88268  
##  M-20   :  3   Max.   :2021-07-14   Max.   :31.550   Max.   :29.32424  
##  (Other):100                                                           
##        CV               min              max             range       
##  Min.   :  1.032   Min.   :-1.320   Min.   : 1.120   Min.   : 0.100  
##  1st Qu.: 13.433   1st Qu.: 6.723   1st Qu.: 9.985   1st Qu.: 2.518  
##  Median : 19.265   Median : 8.420   Median :12.545   Median : 4.060  
##  Mean   : 25.543   Mean   : 8.287   Mean   :13.639   Mean   : 5.352  
##  3rd Qu.: 33.436   3rd Qu.:10.502   3rd Qu.:15.520   3rd Qu.: 6.197  
##  Max.   :105.713   Max.   :19.640   Max.   :65.310   Max.   :52.900  
## 
hist(new_CVs$CV)

hist(CVs$CV)

hist(new_CVs$range) 

hist(CVs$range) 

This definitely improved things, but unfortunately, CVs are still skewed to the right. I think the replicate groups with only 3 replicates are harder to find outliers in.

Check the info for lizards with super high value ranges:

new_CVs %>%
  dplyr::filter(range > 10)
## # A tibble: 14 × 8
## # Groups:   individual_ID [14]
##    individual_ID date        mean    SD    CV   min   max range
##    <fct>         <date>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 F-02          2021-04-23 19.2   6.49  33.8 15.2   26.7  11.4
##  2 F-05          2021-04-24 31.6  29.3   92.9 12.4   65.3  52.9
##  3 F-06          2021-04-24 18.7   8.92  47.6 12.3   28.9  16.6
##  4 F-11          2021-05-08 14.4   5.34  37.1  7.76  20.3  12.6
##  5 F-14          2021-04-24 15.5   9.35  60.3  9.3   26.2  17.0
##  6 F-17          2021-05-08 13.4   4.56  34.0  8.78  19.0  10.2
##  7 M-10          2021-04-24 24.2   5.92  24.4 19.2   30.8  11.5
##  8 W-013         2021-04-24 16.6   5.91  35.7 12.6   23.4  10.7
##  9 W-016         2021-04-24 13.9   5.35  38.5  9.48  19.8  10.3
## 10 W-017         2021-04-24 16.7   6.62  39.6 12.0   24.3  12.3
## 11 W-024         2021-04-24 16.7   5.34  31.9 11.4   22.0  10.7
## 12 W-026         2021-04-25 22.3  17.7   79.5 10.0   42.6  32.5
## 13 W-031         2021-05-07  4.56  4.49  98.4 -1.32  10.4  11.8
## 14 W-037         2021-05-08 14.7   4.67  31.8  7.48  19.8  12.3

Look at the original CEWL measurements for those lizards:

outliers_omitted %>%
  dplyr::filter(individual_ID == "F-02" & date == "2021-04-23")
##         date                time status ID_rep_no CEWL_g_m2h msmt_temp_C
## 1 2021-04-23 2023-11-09 01:52:00 Normal     F02_1      26.69        33.1
## 2 2021-04-23 2023-11-09 01:53:00 Normal     F02_2      15.25        33.7
## 3 2021-04-23 2023-11-09 01:54:00 Normal     F02_3      15.65        33.4
##   msmt_RH_percent ID_len individual_ID replicate_no
## 1            16.9      5          F-02            1
## 2            17.1      5          F-02            2
## 3            16.4      5          F-02            3
outliers_omitted %>%
  dplyr::filter(individual_ID == "F-05" & date == "2021-04-24")
##         date                time status ID_rep_no CEWL_g_m2h msmt_temp_C
## 1 2021-04-24 2023-11-09 12:19:00 Normal     F05_2      65.31        31.0
## 2 2021-04-24 2023-11-09 12:20:00 Normal     F05_3      16.93        29.1
## 3 2021-04-24 2023-11-09 12:21:00 Normal     F05_4      12.41        29.2
##   msmt_RH_percent ID_len individual_ID replicate_no
## 1            21.0      5          F-05            2
## 2            20.6      5          F-05            3
## 3            20.3      5          F-05            4
outliers_omitted %>%
  dplyr::filter(individual_ID == "F-06" & date == "2021-04-24")
##         date                time status ID_rep_no CEWL_g_m2h msmt_temp_C
## 1 2021-04-24 2023-11-09 12:51:00 Normal     F06_1      28.92        30.6
## 2 2021-04-24 2023-11-09 12:52:00 Normal     F06_2      14.97        30.4
## 3 2021-04-24 2023-11-09 12:53:00 Normal     F06_3      12.31        30.1
##   msmt_RH_percent ID_len individual_ID replicate_no
## 1            19.6      5          F-06            1
## 2            18.7      5          F-06            2
## 3            18.8      5          F-06            3
outliers_omitted %>%
  dplyr::filter(individual_ID == "F-11" & date == "2021-05-08") # fine
##         date                time status ID_rep_no CEWL_g_m2h msmt_temp_C
## 1 2021-05-08 2023-11-09 02:11:00 Normal    F-11_1      19.35        31.2
## 2 2021-05-08 2023-11-09 02:12:00 Normal    F-11_2      20.32        31.5
## 3 2021-05-08 2023-11-09 02:13:00 Normal    F-11_3      12.95        32.3
## 4 2021-05-08 2023-11-09 02:13:00 Normal    F-11_4      11.51        33.1
## 5 2021-05-08 2023-11-09 02:14:00 Normal    F-11_5       7.76        32.1
##   msmt_RH_percent ID_len individual_ID replicate_no
## 1            13.7      6          F-11            1
## 2            15.7      6          F-11            2
## 3            16.6      6          F-11            3
## 4            17.6      6          F-11            4
## 5            13.8      6          F-11            5
outliers_omitted %>%
  dplyr::filter(individual_ID == "F-14" & date == "2021-04-24")
##         date                time status ID_rep_no CEWL_g_m2h msmt_temp_C
## 1 2021-04-24 2023-11-09 12:02:00 Normal     F14_1      26.25        26.5
## 2 2021-04-24 2023-11-09 12:03:00 Normal     F14_2       9.30        28.1
## 3 2021-04-24 2023-11-09 12:05:00 Normal     F14_3      10.95        28.3
##   msmt_RH_percent ID_len individual_ID replicate_no
## 1            23.0      5          F-14            1
## 2            24.1      5          F-14            2
## 3            22.1      5          F-14            3
outliers_omitted %>%
  dplyr::filter(individual_ID == "F-17" & date == "2021-05-08") # fine
##         date                time status ID_rep_no CEWL_g_m2h msmt_temp_C
## 1 2021-05-08 2023-11-09 01:53:00 Normal    F-17_1      18.96        31.1
## 2 2021-05-08 2023-11-09 01:53:00 Normal    F-17_2      17.65        31.1
## 3 2021-05-08 2023-11-09 01:54:00 Normal    F-17_3      10.69        31.2
## 4 2021-05-08 2023-11-09 01:54:00 Normal    F-17_4       8.78        30.7
## 5 2021-05-08 2023-11-09 01:55:00 Normal    F-17_5      11.04        30.2
##   msmt_RH_percent ID_len individual_ID replicate_no
## 1            15.2      6          F-17            1
## 2            14.3      6          F-17            2
## 3            14.1      6          F-17            3
## 4            14.1      6          F-17            4
## 5            14.5      6          F-17            5
outliers_omitted %>%
  dplyr::filter(individual_ID == "M-10" & date == "2021-04-24")
##         date                time status ID_rep_no CEWL_g_m2h msmt_temp_C
## 1 2021-04-24 2023-11-09 12:32:00 Normal     M10_1      30.79        29.1
## 2 2021-04-24 2023-11-09 12:33:00 Normal     M10_2      19.25        28.7
## 3 2021-04-24 2023-11-09 12:34:00 Normal     M10_3      22.71        29.2
##   msmt_RH_percent ID_len individual_ID replicate_no
## 1            19.8      5          M-10            1
## 2            20.3      5          M-10            2
## 3            20.3      5          M-10            3
outliers_omitted %>%
  dplyr::filter(individual_ID == "W-013" & date == "2021-04-24")
##         date                time status ID_rep_no CEWL_g_m2h msmt_temp_C
## 1 2021-04-24 2023-11-09 03:58:00 Normal    W013_1      23.36        31.5
## 2 2021-04-24 2023-11-09 03:58:00 Normal    W013_2      13.69        31.4
## 3 2021-04-24 2023-11-09 03:59:00 Normal    W013_3      12.63        31.1
##   msmt_RH_percent ID_len individual_ID replicate_no
## 1            15.6      6         W-013            1
## 2            15.2      6         W-013            2
## 3            15.1      6         W-013            3
outliers_omitted %>%
  dplyr::filter(individual_ID == "W-016" & date == "2021-04-24") # fine
##         date                time status ID_rep_no CEWL_g_m2h msmt_temp_C
## 1 2021-04-24 2023-11-09 04:27:00 Normal    W016_1      19.83        28.6
## 2 2021-04-24 2023-11-09 04:28:00 Normal    W016_2      12.33        28.8
## 3 2021-04-24 2023-11-09 04:29:00 Normal    W016_3       9.48        28.9
##   msmt_RH_percent ID_len individual_ID replicate_no
## 1            15.7      6         W-016            1
## 2            15.2      6         W-016            2
## 3            15.3      6         W-016            3
outliers_omitted %>%
  dplyr::filter(individual_ID == "W-017" & date == "2021-04-24")
##         date                time status ID_rep_no CEWL_g_m2h msmt_temp_C
## 1 2021-04-24 2023-11-09 03:29:00 Normal    W017_1      24.31        30.9
## 2 2021-04-24 2023-11-09 03:30:00 Normal    W017_2      13.88        30.7
## 3 2021-04-24 2023-11-09 03:31:00 Normal    W017_3      12.02        30.6
##   msmt_RH_percent ID_len individual_ID replicate_no
## 1            17.5      6         W-017            1
## 2            15.6      6         W-017            2
## 3            15.3      6         W-017            3
outliers_omitted %>%
  dplyr::filter(individual_ID == "W-024" & date == "2021-04-24") # fine
##         date                time status ID_rep_no CEWL_g_m2h msmt_temp_C
## 1 2021-04-24 2023-11-09 02:56:00 Normal    W024_1      22.02        31.1
## 2 2021-04-24 2023-11-09 02:57:00 Normal    W024_2      11.35        31.6
## 3 2021-04-24 2023-11-09 02:58:00 Normal    W024_3      16.78        31.8
##   msmt_RH_percent ID_len individual_ID replicate_no
## 1            20.1      6         W-024            1
## 2            18.8      6         W-024            2
## 3            18.6      6         W-024            3
outliers_omitted %>%
  dplyr::filter(individual_ID == "W-026" & date == "2021-04-25")
##         date                time status ID_rep_no CEWL_g_m2h msmt_temp_C
## 1 2021-04-25 2023-11-09 02:36:00 Normal    W026_1      42.56        19.1
## 2 2021-04-25 2023-11-09 02:37:00 Normal    W026_2      14.20        19.1
## 3 2021-04-25 2023-11-09 02:37:00 Normal    W026_3      10.04        18.9
##   msmt_RH_percent ID_len individual_ID replicate_no
## 1            36.5      6         W-026            1
## 2            35.6      6         W-026            2
## 3            36.0      6         W-026            3
outliers_omitted %>%
  dplyr::filter(individual_ID == "W-031" & date == "2021-05-07") # def need to remove negative value, yikes
##         date                time status ID_rep_no CEWL_g_m2h msmt_temp_C
## 1 2021-05-07 2023-11-09 04:54:00 Normal   W-031_1       5.49        30.1
## 2 2021-05-07 2023-11-09 04:55:00 Normal   W-031_2       1.84        30.3
## 3 2021-05-07 2023-11-09 04:56:00 Normal   W-031_3      -1.32        29.7
## 4 2021-05-07 2023-11-09 04:57:00 Normal   W-031_4      10.43        29.7
## 5 2021-05-07 2023-11-09 04:58:00 Normal   W-031_5       6.35        29.8
##   msmt_RH_percent ID_len individual_ID replicate_no
## 1            23.3      7         W-031            1
## 2            23.0      7         W-031            2
## 3            23.4      7         W-031            3
## 4            23.7      7         W-031            4
## 5            23.6      7         W-031            5
outliers_omitted %>%
  dplyr::filter(individual_ID == "W-037" & date == "2021-05-08")
##         date                time status ID_rep_no CEWL_g_m2h msmt_temp_C
## 1 2021-05-08 2023-11-09 04:20:00 Normal   W-037_1      19.75        31.4
## 2 2021-05-08 2023-11-09 04:21:00 Normal   W-037_2      17.11        32.0
## 3 2021-05-08 2023-11-09 04:22:00 Normal   W-037_3      15.94        31.6
## 4 2021-05-08 2023-11-09 04:23:00 Normal   W-037_4      13.18        31.1
## 5 2021-05-08 2023-11-09 04:24:00 Normal   W-037_5       7.48        31.1
##   msmt_RH_percent ID_len individual_ID replicate_no
## 1            13.4      7         W-037            1
## 2            14.7      7         W-037            2
## 3            12.9      7         W-037            3
## 4            12.8      7         W-037            4
## 5            12.8      7         W-037            5

Remove Extreme Values

evs_omitted <- outliers_omitted %>%
  dplyr::filter(!(individual_ID == "F-02" & CEWL_g_m2h == 26.69)) %>%
  dplyr::filter(!(individual_ID == "F-05" & CEWL_g_m2h == 65.31)) %>%
  dplyr::filter(!(individual_ID == "F-06" & CEWL_g_m2h == 28.92)) %>%
  dplyr::filter(!(individual_ID == "F-14" & CEWL_g_m2h == 26.25)) %>%
  dplyr::filter(!(individual_ID == "M-10" & CEWL_g_m2h == 30.79)) %>%
  dplyr::filter(!(individual_ID == "W-013" & CEWL_g_m2h == 23.36)) %>%
  dplyr::filter(!(individual_ID == "W-017" & CEWL_g_m2h == 24.31)) %>%
  dplyr::filter(!(individual_ID == "W-026" & CEWL_g_m2h == 42.56)) %>%
  dplyr::filter(!(individual_ID == "W-031" & CEWL_g_m2h == -1.32)) %>%
  dplyr::filter(!(individual_ID == "W-037" & CEWL_g_m2h == 7.48))
nrow(outliers_omitted) == nrow(evs_omitted) + 10
## [1] TRUE

Re-Assess Variation

new_new_CVs <- evs_omitted %>%
  group_by(individual_ID, date) %>%
  summarise(mean = mean(CEWL_g_m2h),
            SD = sd(CEWL_g_m2h),
            CV = (SD/mean) *100,
            min = min(CEWL_g_m2h),
            max = max(CEWL_g_m2h),
            range = max - min)
## `summarise()` has grouped output by 'individual_ID'. You can override using the
## `.groups` argument.
summary(new_new_CVs)
##  individual_ID      date                 mean              SD         
##  F-12   :  3   Min.   :2021-04-23   Min.   : 0.650   Min.   :0.05508  
##  M-09   :  3   1st Qu.:2021-04-24   1st Qu.: 8.486   1st Qu.:1.11937  
##  M-10   :  3   Median :2021-04-24   Median :10.381   Median :1.80387  
##  M-11   :  3   Mean   :2021-05-08   Mean   :10.272   Mean   :1.98105  
##  M-19   :  3   3rd Qu.:2021-05-08   3rd Qu.:12.881   3rd Qu.:2.59019  
##  M-20   :  3   Max.   :2021-07-14   Max.   :21.673   Max.   :5.34626  
##  (Other):100                                                          
##        CV               min              max             range       
##  Min.   :  1.032   Min.   : 0.140   Min.   : 1.120   Min.   : 0.100  
##  1st Qu.: 11.976   1st Qu.: 6.723   1st Qu.: 9.985   1st Qu.: 2.118  
##  Median : 17.945   Median : 8.525   Median :12.485   Median : 3.695  
##  Mean   : 22.421   Mean   : 8.362   Mean   :12.408   Mean   : 4.047  
##  3rd Qu.: 29.047   3rd Qu.:10.527   3rd Qu.:14.865   3rd Qu.: 5.308  
##  Max.   :105.713   Max.   :19.640   Max.   :24.960   Max.   :12.560  
## 
hist(new_CVs$CV)

hist(new_new_CVs$CV)

hist(new_CVs$range) 

hist(new_new_CVs$range) 

Another big improvement. :)

Average Replicates (outliers removed)

CEWL_avgs <- evs_omitted %>%
  group_by(date, individual_ID) %>%
  summarise(CEWL_g_m2h_mean = mean(CEWL_g_m2h),
            CEWL_SD = sd(CEWL_g_m2h),
            CEWL_CV = (CEWL_SD/CEWL_g_m2h_mean)*100,
            msmt_temp_C = mean(msmt_temp_C),
            msmt_RH_percent = mean(msmt_RH_percent)) 
## `summarise()` has grouped output by 'date'. You can override using the
## `.groups` argument.
# tech rep stats
mean(CEWL_avgs$CEWL_CV)
## [1] 22.42122
CEWL_final <- CEWL_avgs %>%
  dplyr::select(date, individual_ID,
                CEWL_g_m2h = CEWL_g_m2h_mean,
                msmt_temp_C, msmt_RH_percent) %>% 
  # calculate VPD based on Campbell & Norman 1998
  mutate(e_s_kPa = 0.611 * exp((17.502*msmt_temp_C)/(msmt_temp_C + 240.97)),
         msmt_VPD_kPa = e_s_kPa*(1 - (msmt_RH_percent/100))
         )
head(CEWL_final)
## # A tibble: 6 × 7
## # Groups:   date [1]
##   date       individual_ID CEWL_g_m2h msmt_temp_C msmt_RH_perc…¹ e_s_kPa msmt_…²
##   <date>     <fct>              <dbl>       <dbl>          <dbl>   <dbl>   <dbl>
## 1 2021-04-23 F-01               10.4         31.7           12.2    4.68    4.11
## 2 2021-04-23 F-02               15.4         33.6           16.8    5.19    4.32
## 3 2021-04-23 F-03                8.40        32.0           14.2    4.76    4.09
## 4 2021-04-23 F-04                9.21        25.0           26.0    3.17    2.35
## 5 2021-04-23 F-11                8.96        31.9           14.0    4.72    4.06
## 6 2021-04-23 F-12                1.49        32.2           13.5    4.82    4.17
## # … with abbreviated variable names ¹​msmt_RH_percent, ²​msmt_VPD_kPa

Final Synthesis

Re-Check Data

Check that we still have data for every individual.

I can check this by comparing original individual IDs to the individual IDs in our final dataset, then selecting/printing the IDs used that are in one but not the other.

unique(CEWL_final$individual_ID) %in% unique(all_CEWL_data$individual_ID)
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [46] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [61] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [76] TRUE TRUE TRUE TRUE TRUE
unique(all_CEWL_data$individual_ID) %in% unique(CEWL_final$individual_ID)
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [46] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [61] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [76] TRUE TRUE TRUE TRUE TRUE

All is as expected. :)

Check how many observations were used to calculate mean CEWL for each individual on each date:

evs_omitted %>%
  group_by(individual_ID, date) %>%
  summarise(n = n()) %>% 
  arrange(n)
## `summarise()` has grouped output by 'individual_ID'. You can override using the
## `.groups` argument.
## # A tibble: 118 × 3
## # Groups:   individual_ID [80]
##    individual_ID date           n
##    <fct>         <date>     <int>
##  1 F-02          2021-04-23     2
##  2 F-05          2021-04-24     2
##  3 F-06          2021-04-24     2
##  4 F-14          2021-04-24     2
##  5 M-10          2021-04-24     2
##  6 W-013         2021-04-24     2
##  7 W-017         2021-04-24     2
##  8 W-026         2021-04-25     2
##  9 F-01          2021-04-23     3
## 10 F-03          2021-04-23     3
## # … with 108 more rows

Between 2-5.

Export

Save the cleaned data for models and figures.

write_rds(CEWL_final, "./data/CEWL_dat_all_clean.RDS")

Reporting

We omitted a total of 105 measurements from our CEWL dataset (465 - 351): 1 replicate was removed for most individuals. We used the boxplot.stats function in R to extract outliers from each set of technical replicates, and 24 points were removed this way (outliers_found dataframe). We removed an additional 10 extreme replicate values from rep groups with extremely high CEWL value ranges where the rest of the reps were very similar; this was always for rep sets of 3, so the outlier just couldn’t be detected statistically. After data cleaning, every individual still had 2-5 technical replicates for each of their measurement dates. The distribution of coefficient of variation values was much better after both data cleaning steps than before.